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Abstract 

In  asset-intensive  services,  a  well-known  challenge  is  to 
maintain  high  availability  of  the  physical  assets  while 
keeping  the  total  maintenance  cost  low.  In  applications  of 
high-value  machinery  such  as  heavy  industrial  equipment,  a 
traditional  approach  is  to  perform  periodic  maintenance 
according  to  a  runtime-based  schedule.  Most  equipment 
vendors  publish  a  maintenance  schedule  based  on  a 
“standard”  or  “average”  working  environment.  In  addition, 
it  is  a  common  practice  that  maintenance  schedules  from 
equipment  vendors  are  highly  conservative  in  order  to 
reduce  in-field  failures  which  gives  an  adverse  perception  of 
a  vendor’s  reputation.  Therefore,  such  a  schedule  may  not 
result  in  satisfactory  performance  as  measured  according  to 
the  owner’s  business  objectives.  Also,  the  assumption  of 
normal  operating  condition  may  not  apply  in  some 
situations.  For  example,  stresses  due  to  frequent  overloading, 
continuous  usage  of  engine  at  a  high  rate  in  tough 
environments,  machine  usage  beyond  its  designed  capacity 
can  serve  as  good  contributors  to  excessive  wear  and 
premature  failures.  In  this  paper  we  propose  a  novel 
computational  framework  to  build  a  data-driven 
economically  optimized  vital  sign  indicator  for  a  given 
component  type  and  an  economic  criterion  (e.g.,  average 
maintenance  cost  per  unit  runtime)  by  combining  different 
sources  of  historical  data  such  as  total  runtime  hours,  load 
carried,  fuel  consumed  and  event  information  from  sensors. 
This  new  vital  sign  indicator  can  be  viewed  as  a  transformed 
time  scale  and  used  to  find  the  optimal  threshold  value  (or 
“scheduled  replacement  time  equivalent”)  for  a  component 
replacement  policy.  Our  case  study  was  based  on  the 
collected  data  from  50  mining  haul  trucks  over  about  6 
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years  in  one  of  the  largest  mining  service  companies  in  the 
world.  We  present  that  the  new  vital  sign  indicator-based 
replacement  policy  for  a  critical  component  type  largely 
improves  on  the  traditional  runtime-based  schedule  in  terms 
of  a  given  economic  criterion,  achieving  a  lower  total 
maintenance  cost  of  the  enterprise. 

1.  Introduction 

A  traditional  replacement  policy  for  components  in  asset¬ 
intensive  service  business  is  often  based  on  runtime  hours- 
based  fixed  time  interval  (“scheduled  replacement  time”) 
that  the  manufacturer  of  equipment  recommends  for 
scheduled  maintenance.  This  is  based  on  standard  usage  in 
an  average  situation  assumed  by  the  manufacturer.  Most 
equipment  vendors  publish  a  maintenance  schedule  based 
on  a  “standard”  or  “average”  working  environment.  In 
addition,  it  is  a  common  practice  that  maintenance  schedules 
from  equipment  vendors  are  highly  conservative  in  order  to 
reduce  in-field  failures  which  gives  an  adverse  perception  of 
a  vendor’s  reputation.  Therefore,  such  a  schedule  may  not 
result  in  satisfactory  performance  as  measured  according  to 
the  owner’s  business  objectives.  Also,  the  assumption  of 
normal  operating  condition  may  not  apply  in  some 
situations.  For  example,  stresses  due  to  frequent  overloading, 
continuous  usage  of  engine  at  a  high  rate  in  tough 
environments,  machine  usage  beyond  its  designed  capacity 
can  serve  as  good  contributors  to  excessive  wear  and 
premature  failures. 

In  asset-intensive  services,  a  well-known  challenge  is  to 
maintain  high  availability  of  the  physical  assets  while 
keeping  the  total  maintenance  cost  low  (Jardine  &  Tsang, 
2013).  The  optimization  of  replacement  decision  policy 
based  on  component  failure  predictions  has  been  critical  in 
the  area  of  condition-based  predictive  asset  management. 
One  of  the  most  popular  approaches  involves  modeling  a 
proportional  hazard  function  (Cox  PHM)  with  time- 
dependent  covariates  and  a  Weibull  baseline  hazard  function 
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(Banjevic,  Jardine,  Makis,  &  Ennis,  2001)(Jardine, 
Banjevic,  Montgomery,  &  Pak,  2008).  In  practice,  the 
modeled  hazard  function  using  this  approach  is  not 
guaranteed  to  be  monotonically  increasing,  and  thus,  it  often 
involves  a  complicated  algorithm  to  compute  the  optimal 
policy  (Wu  &  Ryan,  2011).  Furthermore,  a  non-monotonic 
hazard  function  is  not  very  intuitive  and  cannot  be  viewed 
as  a  new  kind  of  time  scale.  Equipment  managers  would 
often  like  to  have  a  time  scale-like  monotonically  increasing 
measure  for  the  component  replacement  policy.  Then,  they 
could  use  this  new  vital  sign  indicator  measure  exactly  in 
the  same  way  they  used  the  runtime  measure  for 
replacement  decisions. 

In  this  paper  we  propose  a  novel  computational  framework 
to  build  a  data-driven  economically  optimized  vital  sign 
indicator  for  a  given  component  type  and  an  economic 
criterion  (e.g.,  average  maintenance  cost  per  unit  runtime) 
by  combining  different  sources  of  historical  data  such  as 
total  runtime  hours,  load  carried,  fuel  consumed  and  event 
information  from  sensors.  A  vital  sign  indicator  can  provide 
a  measure  that  contains  useful  information  with  respect  to 
the  “health”  of  a  piece  of  a  component  or  equipment,  and 
can  therefore  support  improved  decision  making  in  terms  of 
maintenance  planning  and  execution,  as  well  as  production 
maximization.  This  new  vital  sign  indicator  can  be  viewed 
as  a  transformed  time  scale  and  used  to  find  the  optimal 
threshold  value  (or  “scheduled  replacement  time  equivalent”) 
for  a  component  replacement  policy.  We  provide  an 
individualized  maintenance  plan  for  each  component  based 
on  its  real  usage.  Our  approach  involves  classification  and 
regression  techniques  for  estimating  a  hazard  rate  and  uses 
the  “individualized”  cumulative  failure  probability  model 
for  building  a  vital  sign  indicator. 

Our  case  study  was  based  on  the  collected  data  from  50 
mining  haul  trucks  over  about  6  years  in  one  of  the  largest 
mining  service  companies  in  the  world.  We  present  that  the 
new  vital  sign  indicator-based  replacement  policy  for  a 
critical  component  type  largely  improves  on  the  traditional 
runtime -based  schedule  in  terms  of  a  given  economic 
criterion,  achieving  a  lower  total  maintenance  cost  of  the 
enterprise. 


component  in  the  list.  All  blue  circles  before  T*  correspond 
to  running  components  and  their  observed  runtimes  at  the 
time  of  data  collection.  All  blue  circles  after  T*  correspond 
to  schedule-replaced  components.  Note  that  companies  in 
practice  often  do  not  keep  the  exact  replacement  schedule  at 
T*.  All  red  circles  before  T*  correspond  to  in-field  failure 
replacements.  Note  that  running  and  scheduled  replacement 
components  are  considered  “right-censored”  samples  in 
survival  analysis.  That  is,  we  know  that  the  components 
survived  at  the  time  of  data  collection  or  scheduled 
replacement,  but  cannot  tell  when  those  components  would 
actually  fail  in  the  future. 


Red  circle 

(failure  replacement) 

Blue  circle  (running 
or  scheduled  replacement) 


ogpcogdobo  m  oo 


Runtime 


Failure  Scheduled  Failure  probability 
replacement  replacement  density  function 


Figure  1 .  An  example  of  failure  probability  density  function 
with  the  optimal  scheduled  replacement  time  T* 


Vital  Sign  Indicator 


Figure  2.  An  example  of  vital  sign  indicator  with  the 
optimal  scheduled  replacement  vital  sign  value  v* 


2.  COMPONENT  REPLACEMENT  POLICIES 

2.1.  Runtime-based  Replacement  Policy 

Figure  1  shows  an  example  of  the  failure  probability  density 
function  with  T*  (optimal  scheduled  replacement  time)  for  a 
component  type.  Assuming  that  a  company  has  run  a 
scheduled  replacement  policy  at  T*,  at  the  time  of  collecting 
the  component  data  for  our  analysis,  the  historical  list  of  all 
components  of  this  component  type  over  a  group  of 
equipment  include  running  components  (at  the  time  of  data 
collection),  schedule-replaced  components,  and  failure- 
replaced  components.  In  Figure  1  each  circle  represents  a 


Note  in  Figure  1  that  the  standard  deviation  of  the  failure 
probability  density  function  is  very  large;  thus,  we  have  too 
many  in-field  failure-replaced  components 

2.2.  Vital  sign-based  Replacement  Policy 

Now  we  conceptually  explain  the  development  of  our  new 
vital  sign  indicator  model.  For  the  historical  list  of  all 
components,  we  also  have  the  corresponding  time-stamped 
logs  of  runtime  hours  (meter),  total  fuel  consumption,  total 
work  (load)  and  sensor  events.  Imagine  that  for  the 
component  data  and  the  failure  probability  density  function 
shown  in  Figure  1,  we  can  design  a  vital  sign  indicator 
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(vertical  axis)  in  Figure  2  using  some  features  derived  from 
all  available  information.  Note  that  the  time/color  of  each 
circle  in  Figure  2  are  exactly  the  same  as  those  of  the 
corresponding  circle  in  Figure  1,  and  the  color  (failure 
replacement  (red),  running  or  scheduled  replacement  (blue)) 
of  each  path  is  based  on  the  collected  component  data  (i.e., 
the  traditionally  employed  runtime -based  replacement 
policy),  not  according  to  the  new  vital  sign-based 
replacement  policy. 

Then,  we  propose  a  vital  sign  indicator-based  scheduled 
replacement  policy  that  replaces  components  when  their 
vital  sign  value  reaches  a  threshold  value  v*.  In  Figure  2, 
the  dotted  line  shows  the  threshold  value.  Each  path  in  the 
runtime  vs.  vital  sign  indicator  2-dimensional  plot 
corresponds  to  a  component  and  shows  its  vital  sign 
indicator  profile  over  the  runtime.  Note  that  the  runtime  (= 
the  value  in  the  horizontal  axis)  at  the  intersection  point 
between  the  threshold  line  and  the  path  for  a  component 
indicates  the  actual  replacement  time  using  the  policy. 

Keep  in  mind  that  the  failure  probability  density  function  in 
terms  of  the  vital  sign  indicator  axis  depends  on  our  model 
of  a  vital  sign  indicator.  Intuitively,  one  desirable 
characteristic  for  being  a  good  vital  sign  indicator  is  a  small 
standard  deviation  in  the  vital  sign  indicator  axis.  This 
contributes  to  a  better  classification,  using  a  constant  v*, 
between  the  failure-replaced  components  (above  the  v*  line) 
and  the  other  running/schedule-replaced  components  (below 
the  v*  line).  In  other  words,  if  this  vital  sign  indicator-based 
scheduled  replacement  policy  had  been  used  in  the  past, 
most  of  failure-replaced  components  in  the  collected  data 
(red  circles)  would  have  been  replaced  on  schedule  (at  v*) 
before  the  actual  in-field  failures.  However,  this 
characteristic  about  the  failure  probability  is  not  a  sufficient 
condition  to  be  a  good  vital  sign  indicator  model,  since  the 
average  runtime  to  scheduled  replacements  (i.e.,  the  average 
of  actual  runtimes  from  intersection  points  at  v*)  and  the 
average  runtime  to  failure  replacements  should  also  be  large 
values.  For  this  reason,  we  should  look  into  the  shape  of 
vital  sign  paths  in  the  runtime  vs.  vital  sign  indicator  2- 
dimensional  plot.  We  will  explain  it  using  economic 
optimization  equations  below  in  more  detail. 


scheduled  replacement  time.  However,  in  this  paper  we 
assume  that  the  survival  probability  function  can  be 
estimated  using  a  parametric  Weibull  fit  (Fox,  2002)  to  the 
runtime  and  failure  data. 

For  our  economic  optimization  analysis,  we  are  provided  the 
economic  and  logistic  parameters  including 

Cf=  in- field  failure  replacement  cost,  which  includes  the 
part  and  labor  cost  to  replace  the  component,  the  retrieval 
cost  of  equipment  from  the  field,  and  lost  revenue  due  to 
blocking  other  equipment  when  it  fails  in  the  field  (called 
“circuit  break”), 

Cp  =  scheduled  replacement  cost,  which  includes  the  part 
and  labor  cost  to  replace  the  component, 

cd  =  cost  per  unit  downtime  of  the  equipment,  including 
lost  revenue  that  could  have  been  contributed  by  that  piece 
of  equipment, 

DTf=  down  time  due  to  an  in-field  failure, 

DTp  =  down  time  due  to  a  scheduled  replacement. 

In  general,  in-field  failure  replacement  cost  and  downtime 
are  greater  than  scheduled  replacement  cost  and  downtime, 
respectively  ( Cj  >  Cp  ,  DTf  >  DTp  ). 


Denote  by  tp  the  scheduled  replacement  time  for  the  policy, 
which  is  our  optimization  target.  With  this  scheduled 
replacement  policy,  the  mean  time  to  failure  replacement 

that  happens  before  tp  is  denoted  by  tf  and  estimated  as: 


F{tp) 


[  tp  F(t)dt 
J  o _ 

F(tp) 


A  new  component  lifetime  cycle  starts  at  the  installation 
time  of  a  component.  The  component  may  be  replaced  due 
to  an  in- field  failure  or  a  scheduled  replacement  finishing  its 
lifetime  cycle. 

For  a  runtime -based  replacement  policy,  we  choose  tp  to 
minimize  the  average  maintenance  cost  per  unit  runtime. 


3.  Economic  optimization 


average  total  time  per  cycle 


3.1.  Runtime-based  Replacement  Policy 

Let  F{t )  be  the  cumulative  failure  probability  function  at 
runtime  t  (=  Pr (T  <t )  where  I  is  a  random  variable 
denoting  the  runtime  at  failure),  S(t)  =  l—F(t)  be  the 
survival  probability  function  at  t.  When  we  deal  with  the 
dataset  from  real  industry  practice,  it  is  very  likely  that  there 
is  no  failure  data  after  the  scheduled  replacement  time  the 
company  has  employed  during  the  period  of  the  dataset. 
Therefore,  we  would  not  make  a  good  estimate  on  the  exact 
shape  of  the  function  over  the  time  after  the  current 


=  (tf+DTf)F{tp)  +  {tp+DTpX\-F{tp)) 

average  run  time  per  cycle  =  tf  F(tp  )  + t  (1  -  F(tp  )) 

average  maintenance  cost  per  unit  runtime 

_  average  maintenance  cost  per  cycle 
average  runtime  per  cycle 

(  average  failure  replacement  cost  per  cycle  + 

_  average  scheduled  replacement  cost  per  cycle ) 
average  runtime  per  cycle 
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_{Cf+cdnrf)F{tp)  +  {Cp+cdHTpX\-F{tp)) 
tfF(tp)  +  tp{\-F{tp)) 

=  (Cf-Cp  +  cd(DTf  - DTp ))  ——r—  +  (Cp  +cdDTp)±- 

A  A 

where  X  =  average  run  time  per  cycle 

=  tfF(tp)  +  tp(l-F(tp)) 

As  tp  (=  the  scheduled  replacement  time)  is  set  to  a  higher 
value,  there  is  more  chance  of  in-field  failure  replacements, 
that  is,  F(t  )  (=  the  total  probability  of  in-field  failure 
replacements)  becomes  larger  (See  Figure  3). 

Failure  probability 
d density  function 

- F(tp) — i 

Runtime 

tp 

Figure  3.  The  trade-off  between  the  average  runtime  per 
cycle  and  F(t  )  (=  total  in- field  failure  probability) 

Since  DTf  >  DTp  and  Cy  >  Cp  in  general,  the 
optimization  goal  of  minimizing  the  average  maintenance 
cost  per  unit  runtime  is  achieved  by  increasing  average 
runtime  per  cycle  (X  =  tfF(tp)  +  tp(l-F(tp))  )  and 

decreasing  in- field  failure  probability  per  cycle  F(t  )  .  Note 
that  there  is  a  trade-off  between  decreasing  F(tp)  and 
increasing  the  average  runtime  per  cycle.  In  general, 
decreasing  F(t  )  that  would  involve  fewer  failure 

replacements  can  be  obtained  by  decreasing  tp  ,  but  this 
then  reduces  the  average  run  time  per  cycle.  Note  that  t y 
<  tp  in  general.  Also,  note  that  as  F(t  )  becomes  smaller, 
tp  becomes  more  weighted  in  the  estimate  of  average  run 
time  per  cycle.  Given  F(t),C y  ,C p9cd,  DT y  and  D Tp ,  the 

average  maintenance  cost  per  unit  runtime  is  a  function  of 
tp  ,  which  is  denoted  as  g. 

g(y  )=  (C  y  +  cdDTf  )F(tp  )  +  (CP  +  cdDTp  )(1  -  F(tp  )) 

P  tfF(tp)  +  tp(l-F(tp)) 

It  is  important  to  note  that  the  cumulative  failure  probability 
function  F(t )  is  fixed  and  can  be  estimated  using  the  failure 
data  for  the  component  type  we  analyze.  Note  also  that  tf 
depends  on  F{t).  Then,  the  optimized  time  threshold  for  the 
scheduled  replacement  policy  is  t*p  =  arg  max  g  ( tp  )  . 


3.2.  Vital  Sign-based  Replacement  Policy 

Let  v  be  vital  sign  indicator.  F(v)  be  the  cumulative  failure 
probability  function  at  vital  sign  v  (=  Pr(F  <  v)  where  V  is  a 
random  variable  denoting  the  vital  sign  at  failure), 

S(y)  =  l—F(v)  be  the  survival  probability  function  at  v. 
Note  that  we  estimate  this  survival  probability  function  by  a 
local  regression  (loess)  on  the  Kaplan-Meir  (KM)  estimate 
(Therneau,  2000)  using  the  vital  sign  and  failure  data. 

Denote  by  vp  the  vital  sign  threshold  value  for  scheduled 
replacements  for  the  vital  sign-based  scheduled  replacement 
policy,  which  is  our  optimization  target.  Then,  F{vp  )  is  the 
total  expected  probability  of  failure  replacements,  and 
1  —  F (v p )  is  the  total  expected  probability  of  scheduled 

replacements.  With  this  scheduled  replacement  policy,  the 
expected  time  to  scheduled  replacement  at  vp  is  denoted  by 

tp  .  Also,  the  expected  time  to  failure  replacement  is 
denoted  by  tf.  In  this  paper  we  estimate  tp  and  t  y  under 
reasonable  assumptions. 

Let  Comp[v  >  vp]  denote  the  set  of  all  components  whose 
vital  sign  value  reaches  vp  in  the  dataset,  whereas 
Comp[v  <  vp]  denotes  the  set  of  all  components  whose 
vital  sign  value  v  <  vp  for  all  time  t  in  the  dataset. 

Let  P[v  >  vp]  denote  the  actual  ratio  of  the  number  of 
components  in  Comp[v  >  vp]  to  the  total  number  of 
components  in  the  dataset.  The  actual  ratio  P[v  >  vp]  is 
equal  to  or  smaller  than  1  —  F(vp)  (=  total  expected 

probability  of  scheduled  replacement),  since  the  total 
expected  probability  takes  right-censored  components 
(running  at  the  time  of  data  collection)  into  account.  There 
are  running  components  that  would  fail  with  v  >  vp.  We 
assume  that  those  components  contribute  to  scheduled 
replacements  corresponding  to  the  difference  between  the 

expected  probability  and  the  actual  ratio  (=  1  —  F(vp) 

—P[ v  >  vp]  )  and  that  they  are  schedule-replaced  at  vp  with 
the  cumulative  probability  function  of  the  replacement  time, 
Fv<Vp  0)  =  1  -  $v<vp  (0  where  V<,„  (0  is  the  survival 

probability  function  estimated  using  a  Weibull  fit  to  the 
runtime  and  failure  data  of  Comp[v  <  vp].  In  other  words, 

we  assume  that  Fv<v  ( t )  estimated  using  Comp[v  <  vp]  is 

uniformly  applied  to  all  the  range  of  v  <  vp .  Thus,  the 
mean  scheduled  replacement  time  over  those  components 
corresponding  to  1  —  F(v  )  —P[v  >  vp\  is  the  same  as  the 
mean  failure  time  over  Comp[v  <  vp\,  which  is  denoted  by 

Ji°°  ^ 

Sv<v  ( t)dt .  Thus, 
o  p 
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/V  pOO  ^ 

tf  =  expected  time  to  failure  replacement  =  Sv<v  ( t)dt . 

Jo  p 

tp  =  expected  time  to  scheduled  replacement 

=  {  P[v  >  vp\  E[t\v  =  vp  for  Comp[v  >  vp]\  + 

( 1  -  F(yp  )-P[v>vp])r}/(l- F(vp  ) ). 

Note  that  E[t\v  =  vp  for  Comp[v  >  vp]]  is  the  average  of 
scheduled  replacement  times  at  v  =  vp  over  Comp\v  >  vp]. 

Alternatively,  we  may  assume  that  components  in 
Comp[v  <  vp]  that  would  fail  after  tc  contribute  to 

scheduled  replacements  for  the  difference  (=  1  -F(vp) 

—P[v  >  vp\  ),  whereas  components  in  Comp[v  <  vp\  that 
would  fail  before  tc  are  failure-replaced.  Also,  we  can 
estimate  tc  from  the  constraint  F{yp)  =  Fv<v  (tc )  ( 1  — 

P\v  >  Up]).  That  is,  the  total  expected  probability  of  failure 
replacements  over  all  components  (=  F(vp ) )  should  be  the 

same  as  the  actual  ratio  of  the  number  of  components  in 
Comp[v  <  vp]  to  the  number  of  total  components  in  the 
dataset  (=  1  —  P[v  >  vp])  multiplied  by  the  total  expected 
probability  of  failure  replacements  before  tc  over 
Comp[v  <  vp]  (=Fv<Vp(tc) ).  Thus, 

tf  =  expected  time  to  failure  replacement 

\cf  m 

=  t  -  J() 

"  ‘  Fv<Vp(tc) 

Then,  the  mean  scheduled  replacement  time  over  those 
components  corresponding  to  1  —  F(vp)  —  P[v  >  vp]  is 
denoted  by  r  and  estimated  as 

poo  ^  ^ 

r={Jo  5v<vp(0*  -  tf  Fv<Vp(tc)}/(l-Fv<Vp(tc)). 

tp  =  expected  time  to  scheduled  replacement 
=  {  P[v  >  vp\  E[t\v  =  vp  for  Comp[v  >  vp]]  + 

( 1  -  F(vp  )  -P [v  >  Up])  r  }  /  ( 1  -  F(vp  ) ). 

For  this  vital  sign-based  replacement  policy,  we  choose  vp 
to  minimize  the  average  maintenance  cost  per  unit  runtime. 

Average  maintenance  cost  per  unit  runtime 

_  average  maintenance  cost  per  cycle 
average  runtime  per  cycle 


_(Cf+cdDTf)F(vp)  +  (Cp  +  cdD  TpXl-F(vp)) 
tfF(vp)  +  tp(\-F(vp)) 

=  (C,  -  Cp  +  cd ( DTf  -  DTp  ))  +  (Cp  +  cdDTp  )  j 

where  X  =  average  run  time  per  cycle 
=  t fF  (v  p)  +  t  p(\  —  F  (v  p)) 


Vital  Sign  Indicator 


Figure  4.  Vital-sign  indicator  functions  steeply  increasing 
around  vp  :  no  strong  trade-off  between  the  average  runtime 

per  cycle  and  F(vp )  (=  total  in- field  failure  probability) 

As  in  the  analysis  of  the  runtime -based  policy,  the 
optimization  goal  of  minimizing  average  maintenance  cost 
per  unit  work  is  achieved  by  increasing  average  run  time 

per  cycle  (=  t fF(v  )  + 1  (1  —  F(vp  ))  )  and  decreasing  in¬ 
field  failure  probability  per  cycle  F(vp)  .  However,  in 
contrast  to  the  runtime-based  policy,  with  vital-sign 
indicator  functions  steeply  increasing  around  vp ,  there  is 

no  strong  trade-off  between  decreasing  F(vp)  and 
increasing  the  average  run  time  per  cycle.  In  other  words, 
decreasing  F(vp )  that  would  involve  fewer  failure 

replacements  can  be  obtained  by  decreasing  v  but  this 
does  not  necessarily  lead  to  a  large  decrease  of  t  (=  the 
average  of  scheduled  replacement  times  at  vp  )  when  the 
vital-sign  indicator  functions  are  steeply  increasing  around 
vp  (compared  with  slowly  increasing  shaped  functions). 

More  importantly,  considering  the  definitions  of  t 
(involving  the  term  [t\v  =  vp  for  Comp[v  >  vp]]  )  and  tf 
(involving  Sv<Vp  (t)  or  Fv<v  ( t )  ),  if  decreasing  v p  would 

allow  failures  that  happen  later  in  time  to  be  schedule- 
replaced,  this  would  tend  to  increase  both  t  and  ij- ,  as 

well  as  decreasing  F(vp ) ;  thus,  this  helps  the  optimization 
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goal.  Also,  if  decreasing  v would  allow  failures  that 
happen  earlier  in  time  to  be  schedule-replaced,  this  would 
tend  to  decrease  i  but  still  tend  to  increase  tj  and 

decrease  F(vp ) .  Note  that  in  contrast  to  the  runtime -based 
policy,  t  j  is  not  necessarily  smaller  than  than  tp  for  a  vital 
sign-based  policy.  That  is,  decreasing  t  does  not  lead  to 
decreasing  tj- .  The  values  of  t  and  tj  at  the  optimization 
of  vp  rely  on  the  complete  distribution  and  paths  in  the 
runtime  vs.  vital  sign  indicator  2-dimensional  plot. 

It  is  critical  to  note  that  the  shape  of  cumulative  failure 
probability  function  F(vr)  for  any  candidate  threshold  v' 
can  be  changed  according  to  our  modeling  parameters  to 
design  a  vital  sign  indicator.  Note  also  that  tp  and  tj-  for 

any  candidate  threshold  v'  (i.e.,  functions  of  v')  depend  on 
the  designed  vital  sign  indicator. 

Given  C f,Cp,cd,  DT 'f , DTp ,  F(v'),  tp(v’)  and  tf(v’)  for 

a  designed  vital  sign  indicator,  the  average  maintenance  cost 
per  unit  runtime  is  a  function  of  vp  ,  which  is  denoted  as  g. 

g(vp\F(v'\ip(v'\  tf(v'))  = 

{Cf+cdDTf)F(vp)  +  {Cp+cdDTp)(\-F(vp)) 

t f  ( VP  W(VP )  +  ip  (VP  X1  -  F{yp )) 

Thus,  the  value  ofg  at  vp  is  determined  by  our  design  of  the 
vital  sign  indicator,  which  is  what  the  paths  of  vital  sign 
over  time  look  like. 


Vital  Sign  Indicator 


(a)  Convex-shaped  vital  sign  indicator  model 


Vital  Sign  Indicator 


Red  circle 

(failure 

replacement) 


Blue  circle 
(running 
or  scheduled 
replacement) 


Runtime 


(b)  Concave- shaped  vital  sign  indicator  model 

Figure  5.  Comparing  convex-shaped  and  concave-shaped 
vital  sign  indicator  models 


Then,  the  optimized  vital  sign  threshold  value  for  the 
scheduled  replacement  policy  using  this  vital  sign  indicator 

is  v*  =  arg max  g(vp  |  F(v’),t p(v’),  if(v’))  . 

vp 

We  compare  the  runtime -based  component  replacement 
policy  with  the  new  designed  vital  sign-based  replacement 
policy  in  terms  of  the  average  maintenance  cost  per  unit 
runtime.  That  is,  we  compare  g  (tp )  with 

g(V;\F(v'),ip(v'),  tf(v')). 

If  g(v*p\F(v'),tp(v'),  tf  O'))  >  g  (tp ),  this  means  that  the 

designed  vital-sign  based  replacement  policy  is  more 
beneficial  in  terms  of  the  economic  criterion. 

4.  Building  a  vital  sign  indicator  based  on 

CLASSIFICATION  AND  REGRESSION 


In  Figure  5(a)  and  (b),  we  compare  two  hypothetical  vital 
sign  indicator  models  (convex- shaped  and  concave- shaped) 
when  the  failure  probability  density  functions  in  the  vital 
sign  indicator  axis  are  the  same,  although  this  would  hardly 
happen  in  practice.  For  the  same  vital  sign  threshold  value 
vp,  the  convex  shape  in  Figure  5  (a)  would  have  a  greater 

average  runtime  to  scheduled  replacement  (  tp  =**  the 
average  of  runtimes  from  all  intersection  points)  than  the 
concave  shape  in  Figure  5  (b).  The  convex  paths  would 
predict  the  upcoming  failures  near  the  actual  failure  times, 
whereas  the  concave  paths  would  predict  the  upcoming 
failures  too  early.  The  concave  paths  would  have  a  smaller 
average  runtime  due  to  too  early  replacements.  Thus,  in 
general,  the  convex-shaped  vital  sign  indicator  model  would 
be  more  desirable  than  the  concave-shaped  one.  This  is  also 
why  we  should  look  into  the  complete  vital  sign  paths,  not 
just  examining  the  shape  of  failure  probability  density 
function  or  F(vp ) . 

Before  explaining  our  vital  sign  indicator  model,  we  first 
introduce  the  notion  of  “individualized  cumulative  failure 
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probability  function”.  For  each  individual  component,  let  us 
consider  a  hypothetical  population  of  components  that  share 
the  same  history  of  covariates  as  that  component  has.  Then, 
we  can  define  a  cumulative  distribution  function  of  the 
failure  time  for  the  population.  We  call  it  the  individualized 
cumulative  failure  probability  function  for  the  component. 
In  addition,  the  individualized  cumulative  failure  probability 
function  Fj  (t)  of  component  j  has  the  following  relationship 
with  the  individualized  cumulative  hazard  function  Hj  (t) : 

Fj(t )  —  1  -  Sj (t)  =  1  -  exp where  Sj (t)  is  the 
individualized  survival  probability  function. 

In  this  paper  we  model  the  vital  sign  indicator  using  the 
individualized  cumulative  failure  probability  function.  That 
is,  the  vital  sign  indicator  for  a  component  is  the  same  as  its 
individualized  cumulative  failure  probability  over  runtime. 

In  the  runtime -based  policy  we  select  the  best  scheduled 
replacement  time  so  that  the  cumulative  failure  probability 
F(tp  )  optimizes  the  economic  criterion.  In  contrast,  in  the 

vital  sign-based  policy  for  scheduled  replacements,  we 
apply  a  selected  vital  sign  threshold  value  to  the 
individualized  cumulative  failure  probability  functions  Fj  (t) 
of  components.  This  is  the  same  as  applying  a  common 
threshold  to  the  individualized  cumulative  hazard  functions 
Note  that  this  individualization  in  cumulative  failure 
probability  (or  cumulative  hazard)  is  critical  to  allow  each 
component  to  have  its  own  transformed  time  scale  for  the 
replacement  policy. 

The  individualized  cumulative  hazard  Hj(t)  assesses  the 
total  amount  of  accumulated  risk  that  the  component  j  has 
faced  from  the  beginning  of  time  until  the  present,  while  the 
(instantaneous)  hazard  rate  assesses  the  risk  that  a 
component  which  has  not  yet  had  the  failure  so  will 
experience  it  within  a  unit  of  runtime  (Singer  &  Willett, 
2003).  Compared  to  using  the  hazard  rate  in  designing  a 
scheduled  replacement  policy,  applying  the  individualized 
cumulative  hazard  Hj  (t)  has  some  advantages.  First,  in 
contrast  to  the  hazard  rate,  the  individualized  cumulative 
hazard  may  capture  the  accumulated  wear  and  tear  over  the 
component  runtime.  Second,  the  individualized  cumulative 
hazard  is  always  increasing,  whereas  the  hazard  rate  may  be 
fluctuating  up  and  down  over  the  runtime.  Note  that  the 
characteristic  of  monotonically  increasing  is  necessary 
because  the  vital  sign  indicator  is  conceptualized  as  a 
transformed  time  scale.  In  addition,  people  usually  think 
that  the  accumulated  wear  and  tear  is  always  increasing  over 
the  runtime,  that  is,  the  quality  of  a  component  becomes 
worse  with  runtime. 

Considering  that  our  dataset  includes  daily- interval  samples, 
we  define  the  daily  hazard  hj(d )  on  date  d  for  component  j 
by  the  total  hazard  during  the  daily  runtime.  That  is,  daily 
hazard  =  hazard  rate  x  daily  runtime.  Then,  we  can  estimate 


the  individualized  cumulative  hazard  by  summing  up  all 
daily  hazards  until  the  present  time  t: 

Hj(t)  =  Sail  din {d:MeterU,d)st}hj(.d)  where  Meter(j,d)  is 
the  accumulated  runtime  hours  over  days  up  to  and 
including  date  d. 

Daily  Hazard 


Figure  6.  An  example  of  the  “designed”  daily  hazard  as  a 
regression  target  variable 

It  is  important  to  note  that  the  “estimated”  daily  hazard 
depends  on  our  selection  of  covariates  and  the  model.  Also, 
daily  hazard  estimates  from  a  desirable  model  would  predict 
its  failure  near  the  date  of  actual  failure  time.  Wrong 
predictions  or  too  early  predictions  of  failures  would  lead  to 
the  reduction  of  average  runtime.  Thus,  it  will  be  better  to 
find  the  covariates  and  model  that  enable  the  daily  hazard 
estimates  to  be  convex-shaped  and  very  close  to  the 
maximum  value  (=1)  near  the  date  of  actual  failure  time 
(e.g.,  Figure  6).  In  practice,  however,  we  do  not  require  the 
daily  hazard  estimates  to  be  necessarily  convex-shaped, 
because  it  may  not  be  possible  with  our  selected  features 
and  modeling  choice.  We  only  want  the  individualized 
cumulative  hazards  to  satisfy  some  desired  characteristics 
(monotonically  increasing,  high  values  of  tp  and  tf ,  high 
vital  sign  values  on  the  failure  times)  for  the  economic 
criterion.  Thus,  we  set  up  our  problem  of  designing  a  vital 
sign  indicator  model  as  a  regression  task  where  the 
regression  target  variable  is  the  “designed”  daily  hazard 
hj(d )  we  specify  on  any  date  d  for  component  j  as  follows: 

-  If  the  component  was  failure-replaced,  h;(d)  = 
0 Meter (J,d)/ Meter (j,TF(j)))a  where  Meter(j,d)  is  the 
total  runtime  hours  up  to  and  including  date  J,  TF(j)  is  the 
finally  observed  date  (or  the  replaced  date),  and  a  >  1. 

-  If  the  component  was  schedule-replaced  or  actively 
running,  hj(d)  =  p{Meter{j,d)/Mmax)a  where  Mmax  = 
maXi[Meter{i,TF{i ))]  =  the  maximum  total  runtime  hours 
over  all  components  in  the  dataset,  and  (3  («  1)  is  a  small 
positive  number  close  to  0  (e.g.,  (3  =  0.1). 
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That  is,  shown  as  in  Figure  6,  the  first  equation  satisfies  the 
condition  that  failure-replaced  components  have  the 
maximum  value  (=1)  near  the  date  of  actual  failure  time. 
Also,  the  second  equation  allows  the  running/schedule- 
replaced  components  to  have  low  values  over  their  runtimes. 

We  build  vital  sign  indicator  models  by  performing 
regression  tasks  with  differently  designed  daily  hazard 
setups  (different  a  and  (3  values),  and  find  the  best  vital 
sign  indicator  model  in  terms  of  the  economic  optimization 
criterion  estimate  by  leave-one-component-out  cross- 
validations.  We  will  describe  it  below  in  detail. 

Provided  that  we  have  the  list  of  past  replaced  components 
(failure  or  scheduled  replacements)  and  current  running 
ones  for  a  component  type  over  a  group  of  equipment  as 
well  as  the  corresponding  time -stamped  logs  of  runtime 
hours  (meter),  total  fuel  consumption,  total  work  (load)  and 
sensor  events,  we  propose  a  framework  of  building  a  vital 
sign  indicator  for  the  component  type  using  regression. 

Suppose  that  there  are  totally  J  components  that  were  past 
replaced  or  are  actively  running  for  the  target  component 
type.  For  component  j  (=1,  . . .,  J),  the  start  date  of  service  is 
Ts(j),  and  the  final  date  of  observation  is  TF(j).  Note  that  the 
final  date  of  observation  is  defined  as  the  replaced  date  for 
past  components  or  the  last  observed  date  for  running 
components.  For  this  task,  the  overall  dataset  includes  all 
points  x(j,d)  over  component  j  (=1,  . . .,  J)  and  date  d  ( =Ts(j ), 

Input  data: 

From  the  start  date  of  service  of  component  j, 

■  Meter (j,d)  =  accumulated  runtime  hours  over  days  up  to 
and  including  date  d 

■  Fuel(j,d)  =  accumulated  fuel  consumption  over  days  up 
to  and  including  date  d 

■  Load(j,d)  =  accumulated  number  of  loads  (total  work) 
over  days  up  to  and  including  date  d 

■  EventCount(j ,d)  =  accumulated  number  of  relevant 
sensor  events  for  the  target  component  type  over  days 
up  to  and  including  date  d 

Note  that  Meter(j,  Ts(j))  =  0,  Fuel(j,  Ts(j))  =  0,  Load(j, 
Ts(j))  =  0,  and  EventCount(j,  Ts(j))  =  0.  Here  we  assume 
that  the  relevant  sensor  event  types  for  the  component  type 
are  selected  using  the  significance  test  in  a  univariate  Cox 
proportional  hazard  model  for  each  event  type  (Hastie, 
Tibshirani,  Friedman,  &  Franklin,  2005)(Bair,  Hastie,  Paul, 
&  Tibshirani,  2006).  But  other  techniques  including 
frequent  sequence  mining  (Zaki,  2001)  on  component 
failure  and  event  data  can  be  exploited  for  the  same  purpose. 

Given  the  parameters  such  as 

N smooth  =  positive  integer  for  a  smoothing  filter, 

Nfuei =  positive  real  threshold  value  for  counting  the  number 
of  dates  with  high  daily  fuel  rate, 


Nioad  =  positive  real  threshold  value  for  counting  the  number 
of  dates  with  high  daily  load  rate, 

we  compute  intermediate  variables  as  follows.  Note  that 
these  intermediate  variables  are  used  to  calculate  features. 
Also,  the  purpose  of  Nfud  and  Nioad  is  to  count  outliers. 
Although  we  present  this  simple  rule-based  outlier  detection 
here,  our  framework  allows  other  sophisticated  anomaly 
detection  algorithms  to  be  applied  for  more  effective  feature 
generation. 

Intermediate  variables: 

■  DailyMeter(j ,d)  =  daily  meter  hours  on  date  d 

=  Meter(jyd)  -  Meter(j,d-1 ) 

■  Daily Fuel(j,d)  =  daily  fuel  consumption  on  date  d 

=  Fuel(j,d)  -  Fuel(j,d-1 ) 

■  Daily Load(j ,d)  =  daily  number  of  loads  on  date  d 

=  Load(j,d)  -  Load(j,d-l ) 

■  SmoothedDailyMeter(j,d)  =  average  daily  meter  hours 
over  past  Nsmooth  days  on  date  d 

■  SmoothedDailyFuel(j,d)  =  average  daily  fuel 
consumption  over  past  Nsmooth  days  on  date  d 

■  SmoothedDailyLoad(j,d)  =  average  number  of  loads 
over  past  Nsmooth  days  on  date  d 

■  Daily FuelRate(j ,d)  =  SmoothedDailyFuel(j ,d)  / 

SmoothedDailyMeter(j  ,d) 

■  DailyLoadRate(j,d)  =  SmoothedDailyLoad(j ,d)  / 

SmoothedDailyMeter(j  ,d) 

■  HighFuelRateCount(j,d)  =  accumulated  count  of  days 
in  which  the  daily  fuel  rate  >  Nfuei  over  days  up  to  and 
including  date  d 

■  HighLoadRateCount(j,d)  =  accumulated  count  of  days 
in  which  the  daily  load  rate  >  Nioad  over  days  up  to  and 
including  date  d 

Before  doing  the  regression  task,  we  perform  a  classification 
task  to  estimate  the  probability  of  having  the  component 
failure  within  next  M  runtime  hours  from  each  date.  This 
estimated  failure  probability  can  be  used  as  a  key  predictor 
variable  in  the  later  regression  task.  We  observed  that  this 
failure  probability  improved  fitting  to  the  designed  daily 
hazard  in  the  regression  task. 

For  the  classification  task,  we  now  explain  how  to  compute 
features  and  assign  labels  to  model  the  predicted  failure 
probability. 

Features  for  the  classification  task: 

■  HighFuelRateCountPerMeter(j  ,d)  = 

HighFuelRateCount(j ,d)  /  Meter (j,d) 

■  HighLoadRateCountPerMeter(j  ,d)  = 

HighLoadRateCount(j,d)  /  Meter (j,d) 

■  TotalEuelRate(j,d)  =  Fuel(j,d)  /  Meter (j,d) 

■  TotalLoadRate(j,d)  =  Load(j,d)  /  Meter (j,d) 

■  TotalEventRate(j ,d)  =  EventCount(j,d)  /  Meter (j,d) 
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Label  assignment  for  the  classification  task: 

We  assign  the  classification  label  L(j,d)  to  each  point  x(j,d) 
that  corresponds  to  date  d  for  component  j.  Note  that  x(j,d) 
is  a  multi-dimensional  vector  of  classification  features. 
Among  all  historical  data  of  component  replacements,  there 
are  two  types  of  replacement  on  the  final  date  of 
observation:  scheduled  replacement  and  in-field  failure 
replacement.  The  goal  of  the  classification  task  is  to 
estimate  the  failure  probability  within  the  next  M  runtime 
hours  from  each  date  d.  With  binary  classification  labels  of 
Failure  and  No  Failure  classes, 

■  For  a  point  x(j,d)  on  a  failure-replaced  component  7, 
when  Meter(j,  d)  is  within  M  meter  hours  of  the  failure 
replacement  (that  is,  Meter(j,d)  >  Meter(j,  TF(j))  -  M), 
classification  label  L(j,d)  is  assigned  Failure  class. 
Otherwise,  classification  label  L(j,d)  is  assigned  No 
Failure  class. 

■  For  any  point  x(j,d)  on  a  schedule-replaced  component 
j,  classification  label  L(j,d)  is  assigned  No  Failure  class. 

■  For  any  point  x(j,d)  on  running  component  j, 
classification  label  L(j,d)  is  assigned  No  Failure  class. 

To  measure  the  performance  of  our  model,  we  propose  and 
use  leave-one-component-out  cross  validation.  That  is,  for 
each  run  corresponding  to  a  component  j  (=  1,...,  J ),  we 
split  the  overall  dataset  into  the  test  dataset  of  all  points 
from  component  j  and  the  training  dataset  of  all  points  from 
all  J-l  remaining  components  k  /),  build  a  vital  sign 
indicator  model  based  on  the  training  dataset  only  and 
compute  the  vital  sign  indicator  values  on  all  points  in  the 
test  dataset.  In  more  detail,  we  have  J  runs  in  total,  and  in 
each  run  corresponding  to  a  component  j  we  perform  the 
steps  below. 

Initial  Parameters:  a  and  (3  (designing  daily  hazards), 
N smooth,  Njueh  ^load  (computing  features),  M  (modeling  failure 
probability) 

Step  1.  Divide  the  overall  dataset  into  the  test  dataset  of  all 
points  from  one  component  j  and  the  training  dataset  of  all 
points  from  remaining  components. 

Step  2.  Using  only  the  training  dataset ,  perform  the 
classification  to  build  a  binary  classifier  (e.g.,  applying 
Support  Vector  Classification  (Cristianini  &  Shawe-Taylor, 
2000))  to  compute  the  failure  probability  PfauUre(j>  d)  (= 
probability  of  being  Failure  class)  on  each  point.  This 
estimated  probability  can  be  viewed  as  the  failure 
probability  within  the  next  M  runtime  hours  from  date  d. 

Step  3.  Design  the  target  variable  for  the  regression  task. 
This  regression  target  variable  hk(d )  for  any  component  k 
(^  j)  in  the  training  dataset  should  have  the  desired 
characteristic  of  the  daily  hazard  such  as  being 
monotonically  increasing,  convex-shaped,  and  the 
maximum  value  on  failure. 


Step  4.  Using  only  the  training  dataset ,  build  the  regression 
model  (e.g.,  applying  Support  Vector  Regression 
(Scholkopf  &  Smola,  2002)  to  target  daily  hazard  hk(d) 
with  feature  variables  such  as  Meter (k,d),  Fuel(k,d),  Load(k, 
d),  EventCount(k,d)  and  PfanUre(j>  d). 

Step  5.  Apply  the  built  regression  model  to  obtain  the 
estimated  daily  hazard  hj(d)  for  each  point  x(j,d)  on 
component  j  in  the  testing  dataset. 

Step  6.  Compute  the  individualized  cumulative  hazard  on 
Component  j,  Hj{t)  —  Sail  din  [d\Meter  {j, 

Step  7.  Compute  the  individualized  cumulative  failure 
probability  on  component  j,  Fj(t)  =  1  -  exp(— Hj(t)). 

After  all  J  runs  in  leave-one-component-out  cross 
validations,  we  can  obtain  the  vital  sign  indicator  values 
over  all  components.  Given  these  values,  we  perform  an 
optimization  task  to  obtain  the  optimal  threshold  value  for 
the  replacement  policy  in  terms  of  the  economic 
optimization  criterion  such  as  the  average  maintenance  cost 
per  unit  runtime.  Note  that  in  a  threshold-based 
replacement  policy,  a  component  should  be  replaced  when 
the  vital  sign  indicator  value  reaches  a  threshold  value. 
Optionally,  we  may  use  this  estimated  optimal  threshold 
value  to  normalize  the  vital  sign  indicator.  Then,  a 
component  should  be  replaced  when  its  vital  sign  is  100% 
of  wear. 

In  general,  the  parameter  selections  (a,  (3,  Nsmooth,  Nfueh  Nhad , 
M)  influence  the  ultimate  model.  Thus,  we  need  to  find  the 
optimal  parameters  to  obtain  the  best  vital  sign  indicator 
model  in  terms  of  our  optimization  criterion. 

5.  Case  study 

Our  proposed  framework  of  building  the  vital  sign  indicator 
and  optimizing  the  economical  profit  was  tested  with  one  of 
the  largest  mining  service  companies  in  the  world.  The 
collected  data  includes  the  logs  of  daily  fuel  consumption, 
daily  number  of  loads  moved,  daily  meter  hours,  sensor 
event  data,  and  component  replacement  history  on  50 
mining  haul  trucks  over  the  period  from  January  1st  2007  to 
November  11th  2012.  Each  truck  is  equipped  with  a  set  of 
sensors  triggering  events  on  a  variety  of  vital  machine 
conditions.  Note  that  the  estimated  overall  cost  of  downtime 
for  one  of  these  haul  trucks  amounts  to  about  1.5  million 
USD  per  day.  Therefore,  the  financial  impact  of  reducing 
the  downtime  is  very  large.  This  is  because  not  only  is  the 
scheduled  maintenance  cost  high,  the  total  cost  due  to 
unscheduled  in-field  failure  is  even  higher.  When  one  piece 
of  equipment  breaks  down,  in  addition  to  stopping  its  own 
production,  it  may  block  other  equipment  from  producing. 
The  goal  of  our  vital  sign  indicator  is  to  optimize  the 
tradeoff  between  scheduled  replacement  cost  and 
unscheduled  failure  cost,  to  achieve  a  lower  total 
maintenance  cost  of  the  enterprise. 


672 


Annual  Conference  of  the  Prognostics  and  Health  Management  Society  2014 


Runtime 


I3- 

£ 

3 

o2. 

T3 


15000 


1.00- 


0.00 

0 


Runtime 


15000 


2  0.75 


~0.25- 

> 


Figure  7.  The  individualized  cumulative  hazard  and  the  vital 
sign  indicator,  (a)  and  (b)  from  SVC+SVR  model,  (c)  and  (d) 
from  SVC+Cox  model.  Red  =  Failure  replacements,  Green 
=  Scheduled  replacements,  and  Blue  =  Running  at  the  time 
of  data  collection 


Figure  8.  (a)  Designed  daily  hazard  (a  =  1,  (3  =  0.1),  (b) 
Survival  probability  in  runtime  (KM,  Weibull),  (c)  Survival 
probability  in  vital  sign  indicator  (KM,  loess),  (d)  Survival 
probability  for  Comp[v  <  vp\  (KM,  loess) 


In  this  section  we  present  our  application  and  results 
focused  on  one  specific  component  type  (called  “XI”).  To 
use  our  framework  explained  in  the  steps  above,  we  should 
choose  a  pair  of  classification  and  regression  algorithms.  In 
general  we  can  apply  any  algorithms  for  this  purpose,  but 
here  we  mainly  present  our  results  using  Support  Vector 
Classification  (SVC)  and  Support  Vector  Regression  (SVR). 
We  found  out  that  these  algorithms  using  kernel  tricks 
worked  better  than  other  basic  algorithms  including 
linear/quadratic  discriminant  analysis,  generalized  linear 
models  and  Cox  PH  regression.  Also,  we  compared  vital 
sign  indicator  models  obtained  using  different  parameter 
settings  of  a,  p  (designing  daily  hazards),  Nsmooth,  Nfueh  Nhad 
(computing  features)  and  M  (modeling  failure  probability) 
in  terms  of  our  optimization  criterion.  Here  we  show  the 
result  with  the  RBF  kernel  and  the  best  setting  of  a  = 
1  >  P  0.1,  Nsmooth  —  60,  Nfuei=  190,  Nioad—  3.0,  M  —  4890  in 
our  application. 


Table  1.  Comparison  between  the  traditional  runtime -based 
policy  and  the  vital  sign  indicator-based  policy 


Runtime- 
based  policy 

Vital  sign- 
based  policy 

Threshold 

tp=  16500 

v  p  =  0.50 

Total  failure  probability 

F(tp)=  0.63 

hvp)=  0-21 

Expected  time  to  scheduled 
replacement 

tp  =  16500 

tp  =  14848 

Expected  time  to  failure 
replacement 

tf  =  8708 

?/=7311 

Avg  runtime  per  cycle 

11592 

13201 

Avg  failure  replacement 
cost  per  unit  runtime 

$30.6 

$9.3 

Avg  scheduled  replacement 
cost  per  unit  runtime 

$15.6 

$27.9 

Avg  maintenance  cost  per 
unit  runtime 

$45.6 

$37.2 
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The  economic  and  logistic  parameters  for  the  target 
component  type  are  as  follows:  Cy  =  failure  replacement 
cost  =  $443600,  Cp  =  scheduled  replacement  cost  = 
$374400,  cd  =  cost  per  unit  downtime  of  the  equipment  = 
$2000,  DTf=  down  time  due  to  an  in-field  failure  =  64.8 
hrs,  DTp  =  down  time  due  to  a  scheduled  replacement  =  48 
hrs.  Note  that  ( Cy  +  cdDT y )/( Cp  +  cdDTp  )  =  1 .22. 

Figure  7(a)  and  (b)  show  the  individualized  cumulative 
hazard  and  the  vital  sign  indicator,  respectively,  for  the 
model  based  on  SVC  and  SVR.  In  the  figures,  each  line 
corresponds  to  a  component.  The  color  of  the  line  and 
corresponding  end  point  indicates  whether  the  component 
had  a  failure  replacement  at  the  end  (red),  were  running  at 
the  time  of  data  collection  (blue,  right-censored)  or  had  a 
scheduled  replacement  at  the  end  (green,  right-censored). 

Figure  8(a)  shows  the  designed  daily  hazard.  The  optimized 
vital  sign  threshold  was  0.50.  Based  on  two  different 
approaches  explained  to  estimate  tp  and  tf ,  we  obtained 

almost  similar  values  of  the  criterion  ($37.1  and  $37.2). 
Figure  8(b), (c)  and  (d)  show  survival  probabilities  such  as 
S(t),  S(v)  and  Sv<v  ( t ).  Considering  that  the  cumulative 

failure  probability  corresponds  to  1  -  survival  probability 
(that  is,  F(t )  =  1  -  S(t),  F(v)  =  1  -  S(v)  ),  note  that  F(tp  )  = 

0.63  >  F(vp)  =  0.21.  This  significant  reduction  in  total 
expected  failure  probability  is  a  necessary  condition  for 
being  a  good  vital  sign  indicator.  Also,  comparing 

Sv<Vp  ( t )  and  S(t )  in  Figure  8(b)  and  (d),  we  find  that  the 

expected  lifetime  of  Comp[v  <  vp]  alone  is  significantly 
longer  than  that  of  all  components  in  the  dataset. 

Table  1  compares  the  runtime -based  and  vital-sign  based 
replacement  policy  in  terms  of  the  average  maintenance  cost 
per  unit  runtime.  There  is  about  20%  cost  reduction  with  the 
vital-sign  based  policy,  compared  to  the  runtime -based 
policy.  The  new  vital-sign  based  policy  with  vital  sign 
threshold  =  0.5  has  some  false  failure  predictions  so 
involves  higher  average  scheduled  replacement  cost  per  unit 
runtime  than  the  runtime -based  policy  ($27.9  >  $15.6),  but 
the  vital-sign  based  policy  has  significantly  smaller  average 
failure  replacement  cost  per  unit  runtime  ($9.3  «  $30.6) 
and  thus,  overall  it  is  better  than  the  runtime -based  policy. 

We  tested  Cox  PH  regression  in  combination  with  SVC  in 
our  framework.  In  fact  we  compared  several  Cox  PH 
regression  models  using  differently  selected  features  as 
time-dependent  covariates.  Then,  we  observed  that  the  Cox 
PH  regression  simply  using  the  S VC-estimated  failure 
probability  as  the  only  one  time-dependent  covariate  worked 
best  among  them.  Figure  7(c)  and  (d)  show  the 
individualized  cumulative  hazard  and  the  vital  sign  indicator 
from  this  model.  But,  this  still  performed  a  bit  worse  ($38.0) 


than  the  SVR-based  model  ($37.2).  Note  that  while  Cox  PH 
regression  considers  only  the  covariate  values  at  sampled 
failure  times  (i.e.,  maximizing  the  partial  likelihood),  SVR 
can  consider  covariate  values  at  all  times  (i.e.,  maximizing 
the  fit  to  the  complete  paths  of  the  designed  target  daily 
hazards). 

6.  Conclusion  and  discussion 

We  compared  our  vital  sign  indicator-based  policy  with  a 
traditional  runtime -based  policy  in  terms  of  the  average 
maintenance  cost  per  unit  runtime.  When  the  failure 
replacement  cost  of  a  component  is  extremely  high,  it  is 
critical  to  reduce  the  total  number  of  in-field  failures  by 
following  the  recommended  option  for  decreasing  the  total 
expected  probability  of  failures.  We  modeled  our  vital  sign 
indicator  based  on  “individualized”  cumulative  failure 
probability  function  for  each  component.  This  new  indicator 
as  a  transformed  time  scale  allows  us  to  have  an 
individualized  maintenance  plan  for  each  component  based 
on  its  real  usage.  Our  case  study  demonstrates  that  the  new 
vital  sign  indicator-based  replacement  policy  can  obtain 
greater  economic  value  in  terms  of  the  average  maintenance 
cost  per  unit  runtime. 

Future  work  will  include  a  remaining  useful  lifetime  (RUL) 
model  based  on  this  vital-sign  indicator.  This  will  involve 
the  estimation  of  paths  in  the  runtime  vs.  vital  sign  indicator 
2-dimensional  plot.  Another  future  direction  is  to 
incorporate  a  constrained  regression  to  make  vital  sign 
indicators  suitably  convex-shaped,  eventually  leading  to 
lower  optimal  costs. 
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